Downloading Texts from Project Gutenberg using R

Schweinberger, Martin

doi:10.5281/zenodo.19424866

Downloading Texts from Project Gutenberg using R

This how-to tutorial demonstrates how to download and process public domain texts from Project Gutenberg programmatically in R using the gutenbergr package, covering text search, batch downloading, and the cleaning of Gutenberg-specific formatting for use in corpus analysis. It is aimed at researchers in digital humanities and corpus linguistics who want to build literary corpora from freely available historical texts.

Author

Martin Schweinberger

Published

2026

Great Court, The University of Queensland

Introduction

This how-to guide shows how to download, inspect, and clean texts from the Project Gutenberg archive using R. Project Gutenberg is one of the oldest and largest freely available digital libraries, containing over 70,000 ebooks whose US copyright has expired. It is an invaluable resource for researchers in literary studies, corpus linguistics, computational humanities, and any field requiring access to large amounts of digitised historical and literary text.

The R package gutenbergr provides convenient programmatic access to the Project Gutenberg catalogue, allowing you to search, filter, and download texts directly into your R session without manual downloading or file management.

Before You Start

This guide assumes basic familiarity with R. If you are new to R, please work through the following tutorials first:

What This Guide Covers

Setup — installing and loading required packages
A robust download function — handling mirror failures automatically
Exploring the catalogue — browsing and searching available texts
Filtering by author, language, subject, and rights
Downloading individual texts
Downloading multiple texts simultaneously
Cleaning and preparing downloaded texts — removing boilerplate, splitting into sections, and saving for analysis
Troubleshooting — encoding issues and texts not found

Citation

Martin Schweinberger. 2026. Downloading Texts from Project Gutenberg using R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/gutenberg/gutenberg.html (Version 3.1.1). doi: 10.5281/zenodo.19332882.

Setup

Section Overview

What you’ll learn: How to install and load the packages needed for this guide

Installing Packages

Code

# Install required packages — run once, then comment out
install.packages("gutenbergr")   # access to Project Gutenberg catalogue and downloads
install.packages("dplyr")        # data manipulation (filter, select, mutate)
install.packages("stringr")      # string processing (cleaning text)
install.packages("tidyr")        # reshaping data
install.packages("ggplot2")      # visualisation
install.packages("flextable")    # formatted tables
install.packages("DT")           # interactive data tables
install.packages("here")         # portable file paths

Loading Packages

Code

# Load packages — run at the start of every session
library(gutenbergr)   # Project Gutenberg interface
library(dplyr)        # data manipulation
library(stringr)      # string processing
library(tidyr)        # data reshaping
library(ggplot2)      # plotting
library(flextable)    # formatted tables
library(DT)           # interactive HTML tables
library(here)         # portable file paths

Why Not library(tidyverse)?

Loading individual packages (dplyr, stringr, etc.) is preferable to library(tidyverse) for reproducibility: it makes dependencies explicit, avoids namespace conflicts, and ensures your code works even if the Tidyverse bundle changes. LADAL tutorials follow this best practice throughout.

A Robust Download Function

Section Overview

What you’ll learn: Why direct gutenberg_download() calls sometimes return empty results, and how to define a single reliable helper function that all subsequent downloads use

Project Gutenberg’s servers and mirrors can be unreliable — a direct gutenberg_download() call may silently return zero lines even when the ID is correct. The most robust approach is to:

Try several mirrors in sequence via gutenbergr
Fall back to reading the raw plain-text file directly from the Project Gutenberg cache URL, which is always at https://www.gutenberg.org/cache/epub/{ID}/pg{ID}.txt

We define this logic once as a helper function and use it throughout the guide:

Code

# Helper function: download a single text by Gutenberg ID
# Tries gutenbergr mirrors first; falls back to direct URL read if all fail
# Arguments:
#   id          : integer gutenberg_id
#   meta_fields : character vector of metadata columns to attach (passed to gutenberg_download)
#   title_fallback : title string to use in the fallback data frame
gutenberg_safe <- function(id, meta_fields = "title", title_fallback = NA_character_) {

  # List of mirrors to try in order
  mirrors <- c(
    "http://mirrors.xmission.com/gutenberg/",
    "http://gutenberg.pglaf.org/",
    "https://gutenberg.readingroo.ms/",
    "http://gutenberg.nabasny.com/"
  )

  result <- NULL

  # Step 1: try each mirror via gutenbergr
  for (m in mirrors) {
    tryCatch({
      dl <- gutenberg_download(id, meta_fields = meta_fields, mirror = m)
      if (!is.null(dl) && nrow(dl) > 0) {
        message("Downloaded ID ", id, " via mirror: ", m)
        result <- dl
        break
      }
    }, error   = function(e) NULL,
       warning = function(w) NULL)
  }

  # Step 2: fall back to direct cache URL if all mirrors failed
  if (is.null(result) || nrow(result) == 0) {
    message("All mirrors failed for ID ", id, " — trying direct cache URL")
    cache_url <- paste0("https://www.gutenberg.org/cache/epub/", id, "/pg", id, ".txt")
    tryCatch({
      lines <- readLines(url(cache_url), warn = FALSE, encoding = "UTF-8")
      # Look up title from metadata if not supplied
      if (is.na(title_fallback)) {
        title_fallback <- gutenberg_metadata |>
          dplyr::filter(gutenberg_id == id) |>
          dplyr::pull(title) |>
          dplyr::first()
      }
      result <- data.frame(
        gutenberg_id = id,
        text         = lines,
        title        = title_fallback,
        stringsAsFactors = FALSE
      )
      message("Downloaded ID ", id, " via direct cache URL (", nrow(result), " lines)")
    }, error = function(e) {
      stop("Could not download ID ", id, ": ", conditionMessage(e))
    })
  }

  result
}

Why a Helper Function?

Defining gutenberg_safe() once and calling it throughout means:

Every download in this guide uses the same robust fallback logic
If Project Gutenberg updates its mirror list, you only need to update one place
The function is self-documenting — the mirrors and fallback URL are visible in one location
You can copy gutenberg_safe() directly into your own projects

Exploring the Project Gutenberg Catalogue

Section Overview

What you’ll learn: How to browse and search the full Project Gutenberg catalogue, and what metadata fields are available for filtering

The Metadata Table

The gutenbergr package ships with a metadata table — gutenberg_metadata — that contains information about every text in the Project Gutenberg archive. You can inspect it directly without downloading anything:

Code

# Load the full metadata table
# This is a local data frame included with the gutenbergr package
overview <- gutenberg_metadata

# How many texts are available?
cat("Total texts in catalogue:", nrow(overview), "\n")

Total texts in catalogue: 72569

Code

cat("Metadata columns:", ncol(overview), "\n")

Metadata columns: 8

Code

cat("Column names:", paste(names(overview), collapse = ", "), "\n")

Column names: gutenberg_id, title, author, gutenberg_author_id, language, gutenberg_bookshelf, rights, has_text

The metadata table contains the following key fields:

Field	Description
gutenberg_id	Unique numeric identifier for each text
title	Title of the work
author	Author name in 'Surname, Firstname' format
gutenberg_author_id	Unique identifier for the author (useful for finding all works by one author)
language	ISO 639 language code (e.g. 'en', 'de', 'fr')
gutenberg_bookshelf	Thematic bookshelf category (e.g. 'Science Fiction', 'History')
rights	Copyright status (typically 'Public domain in the USA.')
has_text	Whether a plain text version is available for download

Browsing with `gutenberg_works()`

The gutenberg_works() function is a convenience wrapper around gutenberg_metadata that returns only texts with a downloadable plain text version (has_text == TRUE) in the public domain:

Code

# Browse all available public domain texts with plain text versions
all_works <- gutenberg_works()
cat("Texts available via gutenberg_works():", nrow(all_works), "\n")

Filtering the Catalogue

Section Overview

What you’ll learn: How to filter the Project Gutenberg catalogue by author, language, subject/bookshelf, and multiple criteria to find exactly the texts you need

Filter by Author

Author names in the catalogue are stored in “Surname, Firstname” format:

Code

# Find all works by Charles Darwin using exact name format
darwin_works <- gutenberg_works(author == "Darwin, Charles")
cat("Works by Charles Darwin:", nrow(darwin_works), "\n")

Works by Charles Darwin: 31

When unsure of the exact name format, use str_detect() for partial matching:

Code

# Partial name search — more robust than exact matching
austen_works <- gutenberg_works(
  stringr::str_detect(author, "Austen")
)
cat("Works matching 'Austen':", nrow(austen_works), "\n")

Works matching 'Austen': 16

Code

austen_works |> dplyr::select(gutenberg_id, title, author)

# A tibble: 16 × 3
   gutenberg_id title                                                     author
          <int> <chr>                                                     <chr> 
 1          105 "Persuasion"                                              Auste…
 2          121 "Northanger Abbey"                                        Auste…
 3          141 "Mansfield Park"                                          Auste…
 4          158 "Emma"                                                    Auste…
 5          946 "Lady Susan"                                              Auste…
 6         1212 "Love and Freindship [sic]"                               Auste…
 7         1342 "Pride and Prejudice"                                     Auste…
 8        17797 "Memoir of Jane Austen"                                   Auste…
 9        21839 "Sense and Sensibility"                                   Auste…
10        22536 "Jane Austen, Her Life and Letters: A Family Record"      Auste…
11        22536 "Jane Austen, Her Life and Letters: A Family Record"      Auste…
12        31100 "The Complete Project Gutenberg Works of Jane Austen\nA … Auste…
13        33513 "The Frightened Planet"                                   Auste…
14        37431 "Pride and Prejudice, a play founded on Jane Austen's no… Auste…
15        39897 "Discoveries Among the Ruins of Nineveh and Babylon"      Layar…
16        42078 "The Letters of Jane Austen\r\nSelected from the compila… Auste…

Filter by Language

The language field uses ISO 639-1 two-letter codes:

Code

# Count German-language texts available
gutenberg_works(
  languages     = "de",
  all_languages = TRUE
) |>
  dplyr::count(language, sort = TRUE)

# A tibble: 1 × 2
  language     n
  <chr>    <int>
1 de        1296

Common Language Codes

Code	Language	Code	Language
`en`	English	`de`	German
`fr`	French	`it`	Italian
`es`	Spanish	`nl`	Dutch
`pt`	Portuguese	`la`	Latin
`fi`	Finnish	`zh`	Chinese

For a full list, see the ISO 639-1 standard.

Code

# Count texts per language across the full catalogue
lang_counts <- gutenberg_metadata |>
  dplyr::filter(has_text == TRUE) |>
  dplyr::count(language, sort = TRUE) |>
  dplyr::filter(!is.na(language)) |>
  head(15)

Code

lang_counts |>
  dplyr::mutate(language = reorder(language, n)) |>
  ggplot(aes(x = language, y = n)) +
  geom_col(fill = "steelblue", width = 0.7) +
  coord_flip() +
  labs(
    title    = "Project Gutenberg: Texts by Language",
    subtitle = "Top 15 languages (texts with downloadable plain text only)",
    x        = "Language (ISO 639-1)",
    y        = "Number of texts"
  ) +
  theme_bw() +
  theme(panel.grid.minor = element_blank())

Filter by Subject / Bookshelf

Project Gutenberg organises texts into thematic “bookshelves”:

Code

# Find all texts on the Science Fiction bookshelf
scifi <- gutenberg_works(
  stringr::str_detect(gutenberg_bookshelf, "Science Fiction")
)
cat("Science Fiction texts:", nrow(scifi), "\n")

Science Fiction texts: 1306

Code

scifi |> dplyr::select(gutenberg_id, title, author) |> head(10)

# A tibble: 10 × 3
   gutenberg_id title                                       author              
          <int> <chr>                                       <chr>               
 1           36 The War of the Worlds                       Wells, H. G. (Herbe…
 2           42 The Strange Case of Dr. Jekyll and Mr. Hyde Stevenson, Robert L…
 3           62 A Princess of Mars                          Burroughs, Edgar Ri…
 4           64 The Gods of Mars                            Burroughs, Edgar Ri…
 5           68 The warlord of Mars                         Burroughs, Edgar Ri…
 6           72 Thuvia, Maid of Mars                        Burroughs, Edgar Ri…
 7           86 A Connecticut Yankee in King Arthur's Court Twain, Mark         
 8           96 The Monster Men                             Burroughs, Edgar Ri…
 9           97 Flatland: A Romance of Many Dimensions      Abbott, Edwin Abbott
10          123 At the Earth's Core                         Burroughs, Edgar Ri…

Code

# Browse the top 20 most populated bookshelves
gutenberg_metadata |>
  dplyr::filter(!is.na(gutenberg_bookshelf), has_text == TRUE) |>
  tidyr::separate_rows(gutenberg_bookshelf, sep = "/") |>
  dplyr::mutate(gutenberg_bookshelf = stringr::str_trim(gutenberg_bookshelf)) |>
  dplyr::count(gutenberg_bookshelf, sort = TRUE) |>
  head(20)

# A tibble: 20 × 2
   gutenberg_bookshelf                                        n
   <chr>                                                  <int>
 1 ""                                                     39230
 2 "Science Fiction"                                       1323
 3 "FR Littérature"                                         666
 4 "Children's Book Series"                                 494
 5 "Punch"                                                  493
 6 "Bestsellers, American, 1895-1923"                       394
 7 "World War I"                                            386
 8 "Historical Fiction"                                     340
 9 "US Civil War"                                           337
10 "Children's Fiction"                                     329
11 "Animal"                                                 303
12 "DE Prosa"                                               295
13 "Children's Literature"                                  269
14 "Technology"                                             236
15 "L'Illustration"                                         220
16 "Notes and Queries"                                      217
17 "The Mirror of Literature, Amusement, and Instruction"   202
18 "Christianity"                                           178
19 "Children's Picture Books"                               177
20 "United Kingdom"                                         170

Filter by Multiple Criteria

Combine conditions to narrow the catalogue precisely:

Code

# English-language science texts
english_science <- gutenberg_works(
  language == "en",
  stringr::str_detect(gutenberg_bookshelf, "(?i)science|natural|biology|astronomy")
)
cat("English science texts:", nrow(english_science), "\n")

English science texts: 1445

Code

english_science |>
  dplyr::select(gutenberg_id, title, author, gutenberg_bookshelf) |>
  head(10)

# A tibble: 10 × 4
   gutenberg_id title                                 author gutenberg_bookshelf
          <int> <chr>                                 <chr>  <chr>              
 1           36 The War of the Worlds                 Wells… Movie Books/Scienc…
 2           42 The Strange Case of Dr. Jekyll and M… Steve… Precursors of Scie…
 3           62 A Princess of Mars                    Burro… Best Books Ever Li…
 4           64 The Gods of Mars                      Burro… Science Fiction    
 5           68 The warlord of Mars                   Burro… Science Fiction    
 6           72 Thuvia, Maid of Mars                  Burro… Science Fiction    
 7           86 A Connecticut Yankee in King Arthur'… Twain… Precursors of Scie…
 8           96 The Monster Men                       Burro… Science Fiction    
 9           97 Flatland: A Romance of Many Dimensio… Abbot… Science Fiction/Ma…
10          123 At the Earth's Core                   Burro… Science Fiction

Downloading Individual Texts

Section Overview

What you’ll learn: How to download a single text by ID using gutenberg_safe(), and what the downloaded data looks like

Always Use the Gutenberg ID

Every text has a unique numeric ID visible in its Project Gutenberg URL (e.g., gutenberg.org/ebooks/1513). Downloading by ID is more reliable than searching by title, which can match multiple entries. Use gutenberg_works() or browse gutenberg.org to look up IDs before downloading.

Download Romeo and Juliet (ID: 1513)

Code

# Download Romeo and Juliet using gutenberg_safe()
# gutenberg_safe() tries multiple mirrors, then falls back to the direct cache URL
romeo <- gutenberg_safe(1513)

cat("Downloaded:", nrow(romeo), "lines\n")

Downloaded: 5647 lines

Code

cat("Columns:", paste(names(romeo), collapse = ", "), "\n")

Columns: gutenberg_id, text, title

gutenberg_id	text	title
1,513	The Project Gutenberg eBook of Romeo and Juliet	Romeo and Juliet
1,513		Romeo and Juliet
1,513	This ebook is for the use of anyone anywhere in the United States and	Romeo and Juliet
1,513	most other parts of the world at no cost and with almost no restrictions	Romeo and Juliet
1,513	whatsoever. You may copy it, give it away or re-use it under the terms	Romeo and Juliet
1,513	of the Project Gutenberg License included with this ebook or online	Romeo and Juliet
1,513	at www.gutenberg.org. If you are not located in the United States,	Romeo and Juliet
1,513	you will have to check the laws of the country where you are located	Romeo and Juliet
1,513	before using this eBook.	Romeo and Juliet
1,513		Romeo and Juliet
1,513	Title: Romeo and Juliet	Romeo and Juliet
1,513		Romeo and Juliet
1,513	Author: William Shakespeare	Romeo and Juliet
1,513		Romeo and Juliet
1,513	Release date: November 1, 1998 [eBook #1513]	Romeo and Juliet

Download with Additional Metadata

The meta_fields argument attaches metadata columns to the downloaded text — useful when combining multiple texts into a corpus:

Code

# Download On the Origin of Species with title, author, and language attached
origin_species <- gutenberg_safe(
  1228,                                          # On the Origin of Species
  meta_fields = c("title", "author", "language")
)

cat("Title:",    unique(origin_species$title), "\n")

Title: On the Origin of Species By Means of Natural Selection
Or, the Preservation of Favoured Races in the Struggle for Life

Code

cat("Author:",   unique(origin_species$author), "\n")

Author:

Code

cat("Language:", unique(origin_species$language), "\n")

Language:

Code

cat("Lines:",    nrow(origin_species), "\n")

Lines: 16570

Downloading Multiple Texts

Section Overview

What you’ll learn: How to download several texts at once and organise them into a labelled corpus ready for analysis

Downloading by ID Vector

To download multiple texts, call gutenberg_safe() for each ID and combine the results with dplyr::bind_rows():

Code

# Download Wuthering Heights (768) and Jane Eyre (1260)
# Call gutenberg_safe() for each ID, then stack the results
bronte_texts <- dplyr::bind_rows(
  gutenberg_safe(768),    # Wuthering Heights — Emily Brontë
  gutenberg_safe(1260)    # Jane Eyre — Charlotte Brontë
)

# How many lines from each text?
bronte_texts |>
  dplyr::count(title, name = "lines")

# A tibble: 2 × 2
  title                       lines
  <chr>                       <int>
1 Jane Eyre: An Autobiography 21381
2 Wuthering Heights           12342

title	Number of lines
Jane Eyre: An Autobiography	21,381
Wuthering Heights	12,342

Downloading All Works by an Author

Retrieve all IDs for an author from the catalogue, then loop through them:

Code

# Find all Charles Dickens IDs
dickens_ids <- gutenberg_works(
  author   == "Dickens, Charles",
  language == "en"
) |>
  dplyr::pull(gutenberg_id)

cat("Dickens texts available:", length(dickens_ids), "\n")

Dickens texts available: 54

Code

cat("First 10 IDs:", paste(head(dickens_ids, 10), collapse = ", "), "\n")

First 10 IDs: 46, 564, 580, 699, 700, 730, 766, 821, 917, 963

Code

# Download all Dickens texts — this may take several minutes
# purrr::map_dfr() loops over each ID and stacks the results
dickens_corpus <- purrr::map_dfr(
  dickens_ids,
  ~ gutenberg_safe(.x, meta_fields = c("title", "author"))
)

cat("Total lines:", nrow(dickens_corpus), "\n")
cat("Texts downloaded:", length(unique(dickens_corpus$title)), "\n")

Large Downloads

Downloading many texts at once can take several minutes. Best practices:

Save immediately after downloading (see the Saving section below) to avoid re-downloading
Download in batches if fetching more than ~20 texts
Be respectful of Project Gutenberg’s resources — it is a non-profit volunteer project

Building a Multi-Author Corpus

Code

# Download three 19th-century texts for comparative analysis:
# Moby Dick (2701), Pride and Prejudice (1342), On the Origin of Species (1228)
comparison_corpus <- dplyr::bind_rows(
  gutenberg_safe(2701, meta_fields = c("title", "author")),  # Moby Dick
  gutenberg_safe(1342, meta_fields = c("title", "author")),  # Pride and Prejudice
  gutenberg_safe(1228, meta_fields = c("title", "author"))   # On the Origin of Species
) |>
  # If the author column is missing (fallback download), add it from metadata
  (\(df) {
    if (!"author" %in% names(df)) {
      df <- df |>
        dplyr::left_join(
          gutenberg_metadata |> dplyr::select(gutenberg_id, author),
          by = "gutenberg_id"
        )
    }
    df
  })()

# Corpus summary
comparison_corpus |>
  dplyr::group_by(author, title) |>
  dplyr::summarise(
    lines = dplyr::n(),
    words = sum(stringr::str_count(text, "\\S+"), na.rm = TRUE),
    .groups = "drop"
  )

# A tibble: 3 × 4
  author           title                                            lines  words
  <chr>            <chr>                                            <int>  <int>
1 Austen, Jane     "Pride and Prejudice"                            14911 130410
2 Darwin, Charles  "On the Origin of Species By Means of Natural S… 16570 158589
3 Melville, Herman "Moby Dick; Or, The Whale"                       22310 215840

author	title	Lines	Words
Austen, Jane	Pride and Prejudice	14,911	130,410
Darwin, Charles	On the Origin of Species By Means of Natural Selection Or, the Preservation of Favoured Races in the Struggle for Life	16,570	158,589
Melville, Herman	Moby Dick; Or, The Whale	22,310	215,840

Cleaning and Preparing Downloaded Texts

Section Overview

What you’ll learn: How to remove Project Gutenberg boilerplate, collapse lines into continuous text, split into chapters or acts, and save cleaned texts for analysis

Why this matters: Raw downloads include licence notices and formatting artefacts that distort frequency analysis, topic models, and other quantitative methods if not removed.

What Raw Downloads Look Like

Each download is a line-by-line data frame. The first and last portions contain boilerplate licence text:

Code

# Inspect the opening lines — boilerplate header visible here
head(romeo$text, 30)

 [1] "The Project Gutenberg eBook of Romeo and Juliet"                                      
 [2] "    "                                                                                 
 [3] "This ebook is for the use of anyone anywhere in the United States and"                
 [4] "most other parts of the world at no cost and with almost no restrictions"             
 [5] "whatsoever. You may copy it, give it away or re-use it under the terms"               
 [6] "of the Project Gutenberg License included with this ebook or online"                  
 [7] "at www.gutenberg.org. If you are not located in the United States,"                   
 [8] "you will have to check the laws of the country where you are located"                 
 [9] "before using this eBook."                                                             
[10] ""                                                                                     
[11] "Title: Romeo and Juliet"                                                              
[12] ""                                                                                     
[13] "Author: William Shakespeare"                                                          
[14] ""                                                                                     
[15] "Release date: November 1, 1998 [eBook #1513]"                                         
[16] "                Most recently updated: September 18, 2025"                            
[17] ""                                                                                     
[18] "Language: English"                                                                    
[19] ""                                                                                     
[20] "Credits: the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers"
[21] ""                                                                                     
[22] ""                                                                                     
[23] "*** START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***"                        
[24] ""                                                                                     
[25] ""                                                                                     
[26] ""                                                                                     
[27] ""                                                                                     
[28] "THE TRAGEDY OF ROMEO AND JULIET"                                                      
[29] ""                                                                                     
[30] "by William Shakespeare"

Code

# Inspect the closing lines — boilerplate footer visible here
tail(romeo$text, 20)

 [1] "Gutenberg™ concept of a library of electronic works that could be"    
 [2] "freely shared with anyone. For forty years, he produced and"          
 [3] "distributed Project Gutenberg™ eBooks with only a loose network of"   
 [4] "volunteer support."                                                   
 [5] ""                                                                     
 [6] "Project Gutenberg™ eBooks are often created from several printed"     
 [7] "editions, all of which are confirmed as not protected by copyright in"
 [8] "the U.S. unless a copyright notice is included. Thus, we do not"      
 [9] "necessarily keep eBooks in compliance with any particular paper"      
[10] "edition."                                                             
[11] ""                                                                     
[12] "Most people start at our website which has the main PG search"        
[13] "facility: www.gutenberg.org."                                         
[14] ""                                                                     
[15] "This website includes information about Project Gutenberg™,"          
[16] "including how to make donations to the Project Gutenberg Literary"    
[17] "Archive Foundation, how to help produce our new eBooks, and how to"   
[18] "subscribe to our email newsletter to hear about new eBooks."          
[19] ""                                                                     
[20] ""

Removing Boilerplate

Every Project Gutenberg text uses *** START OF and *** END OF as consistent boundary markers:

Code

# Find the start and end marker line positions
start_marker <- which(stringr::str_detect(romeo$text, "\\*\\*\\* START OF"))
end_marker   <- which(stringr::str_detect(romeo$text, "\\*\\*\\* END OF"))

cat("START marker at line:", start_marker, "\n")

START marker at line: 23

Code

cat("END marker at line:",   end_marker, "\n")

END marker at line: 5297

Code

# Keep only lines between the two markers
romeo_clean <- romeo |>
  dplyr::slice((start_marker + 1):(end_marker - 1)) |>
  dplyr::filter(!is.na(text))

cat("Lines after boilerplate removal:", nrow(romeo_clean),
    "(removed", nrow(romeo) - nrow(romeo_clean), ")\n")

Lines after boilerplate removal: 5273 (removed 374 )

Removing Empty Lines

Code

# Remove lines that are empty or contain only whitespace
romeo_clean <- romeo_clean |>
  dplyr::filter(stringr::str_trim(text) != "")

cat("Lines after removing empty lines:", nrow(romeo_clean), "\n")

Lines after removing empty lines: 4137

Collapsing to a Single String

Code

# Join all lines into one continuous string, then normalise whitespace
romeo_text <- romeo_clean$text |>
  paste(collapse = " ") |>
  stringr::str_squish()

cat("Total characters:", nchar(romeo_text), "\n")

Total characters: 141104

Code

cat("First 300 characters:\n", substr(romeo_text, 1, 300), "\n")

First 300 characters:
 THE TRAGEDY OF ROMEO AND JULIET by William Shakespeare Contents THE PROLOGUE. ACT I Scene I. A public place. Scene II. A Street. Scene III. Room in Capulet’s House. Scene IV. A Street. Scene V. A Hall in Capulet’s House. ACT II CHORUS. Scene I. An open place adjoining Capulet’s Garden. Scene II. Cap

Splitting into Acts and Scenes

Code

# Split Romeo and Juliet into Acts using a regex on Roman numeral headings
acts <- romeo_text |>
  stringr::str_replace_all("(ACT [IVX]+\\.?)", "|||\\1") |>  # insert split marker
  stringr::str_split("\\|\\|\\|") |>
  unlist() |>
  (\(x) x[nchar(stringr::str_trim(x)) > 20])()   # drop very short fragments

cat("Segments found:", length(acts), "\n")

Segments found: 11

Code

cat("Segment 2 begins:", substr(acts[2], 1, 120), "\n")

Segment 2 begins: ACT I Scene I. A public place. Scene II. A Street. Scene III. Room in Capulet’s House. Scene IV. A Street. Scene V. A Ha

Splitting into Chapters

Code

# Clean Wuthering Heights from the bronte_texts corpus
wuthering <- bronte_texts |>
  dplyr::filter(stringr::str_detect(title, "Wuthering"))

# Diagnostic: check what the opening lines look like
# (useful for seeing the exact marker format used)
cat("First 5 lines:\n")

First 5 lines:

Code

cat(head(wuthering$text, 5), sep = "\n")

Wuthering Heights

by Emily Brontë

Code

# Find boilerplate markers — try several common variants
wh_start <- which(stringr::str_detect(
  wuthering$text,
  stringr::regex("\\*{3}\\s*START OF", ignore_case = TRUE)
))
wh_end <- which(stringr::str_detect(
  wuthering$text,
  stringr::regex("\\*{3}\\s*END OF", ignore_case = TRUE)
))

# If markers not found, use the full text with no trimming
if (length(wh_start) == 0) {
  cat("START marker not found — using full text\n")
  wh_start <- 0L
}

START marker not found — using full text

Code

if (length(wh_end) == 0) {
  cat("END marker not found — using full text\n")
  wh_end <- nrow(wuthering) + 1L
}

END marker not found — using full text

Code

# Slice between markers (or use full text if markers absent)
wh_text <- wuthering |>
  dplyr::slice((wh_start[1] + 1):(wh_end[1] - 1)) |>
  dplyr::filter(stringr::str_trim(text) != "") |>
  dplyr::pull(text) |>
  paste(collapse = " ") |>
  stringr::str_squish()

cat("Characters in cleaned text:", nchar(wh_text), "\n")

Characters in cleaned text: 643482

Code

# Split on CHAPTER headings (Roman or Arabic numerals)
wh_chapters <- wh_text |>
  stringr::str_replace_all("(CHAPTER\\s+[IVXLCDM0-9]+\\.?)", "|||\\1") |>
  stringr::str_split("\\|\\|\\|") |>
  unlist() |>
  (\(x) x[nchar(stringr::str_trim(x)) > 50])()

cat("Chapters found:", length(wh_chapters), "\n")

Chapters found: 34

Code

cat("Chapter 1 begins:", substr(wh_chapters[2], 1, 150), "\n")

Chapter 1 begins: CHAPTER II Yesterday afternoon set in misty and cold. I had half a mind to spend it by my study fire, instead of wading through heath and mud to Wuthe

Saving Cleaned Texts

Save downloaded and cleaned data immediately to avoid re-downloading in future sessions:

Code

# Create data directory if needed
if (!dir.exists(here::here("data"))) {
  dir.create(here::here("data"), recursive = TRUE)
}

# Save as RDS (R's native binary format — fast and lossless)
saveRDS(romeo_text,        here::here("data", "romeo_clean.rds"))
saveRDS(wh_chapters,       here::here("data", "wh_chapters.rds"))
saveRDS(comparison_corpus, here::here("data", "comparison_corpus.rds"))

# Save as plain text for use outside R
writeLines(romeo_text, here::here("data", "romeo_clean.txt"))

cat("Saved to:", here::here("data"), "\n")

Code

# Load saved data in future sessions — no re-downloading needed
romeo_text     <- readRDS(here::here("data", "romeo_clean.rds"))
wh_chapters    <- readRDS(here::here("data", "wh_chapters.rds"))
comparison_corpus <- readRDS(here::here("data", "comparison_corpus.rds"))

Troubleshooting

Section Overview

What you’ll learn: How to handle encoding issues and texts that are not found in the catalogue

Encoding Issues

Some older texts use Latin-1 encoding rather than UTF-8, producing garbled characters for accented letters:

Code

# Fix garbled characters by re-encoding from Latin-1 to UTF-8
text_fixed <- text |>
  dplyr::mutate(
    text = iconv(text, from = "latin1", to = "UTF-8", sub = "byte")
  )

# For individual strings
clean_line <- enc2utf8(text$text[1])

Text Not Found

If gutenberg_works() returns zero rows or gutenberg_safe() fails:

Code

# Problem 1: exact title match fails
# Solution: partial, case-insensitive search
gutenberg_works(
  stringr::str_detect(stringr::str_to_lower(title), "romeo")
)

# Problem 2: text has no downloadable plain text version
# Solution: check has_text == TRUE
gutenberg_metadata |>
  dplyr::filter(title == "Romeo and Juliet", has_text == TRUE)

# Problem 3: check rights status
gutenberg_metadata |>
  dplyr::filter(title == "Romeo and Juliet") |>
  dplyr::select(gutenberg_id, title, rights, has_text)

Verifying a Download

Code

# Reusable function to quickly check a downloaded text data frame
verify_download <- function(text_df, min_lines = 100) {
  cat("--- Download Verification ---\n")
  cat("Rows:", nrow(text_df), "\n")
  cat("Columns:", paste(names(text_df), collapse = ", "), "\n")
  cat("Empty lines:", sum(is.na(text_df$text) | text_df$text == ""), "\n")
  if ("title"  %in% names(text_df)) cat("Title:",  unique(text_df$title),  "\n")
  if ("author" %in% names(text_df)) cat("Author:", unique(text_df$author), "\n")
  if (nrow(text_df) < min_lines) warning("Download seems very short — check for errors")
  cat("First non-empty line:",
      text_df$text[which(nzchar(text_df$text))[1]], "\n")
}

verify_download(romeo, min_lines = 500)

--- Download Verification ---
Rows: 5647 
Columns: gutenberg_id, text, title 
Empty lines: 1194 
Title: Romeo and Juliet 
First non-empty line: The Project Gutenberg eBook of Romeo and Juliet

AI Statement

This how-to guide was substantially revised and expanded from the original LADAL draft (gutenberg.qmd) with the assistance of Claude (Anthropic), an AI language model. The AI was used to: restructure the guide into a logical sequence of sections; add the gutenberg_safe() helper function (applying the mirror-loop + direct-URL fallback pattern consistently across all download calls, replacing the original gutenberg_download() pipe approach that returned empty results); expand filtering coverage to include bookshelf filtering, multi-criteria filtering, and partial name matching; add the cleaning and preparation section (boilerplate removal, splitting into acts/chapters, saving/loading); add the troubleshooting section; add the language frequency bar plot and metadata fields table; convert all formatting to Quarto callouts and LADAL flextable style; and update the YAML and citation. All content and workflow decisions were reviewed by the tutorial author.

Citation & Session Info

Citation

Martin Schweinberger. 2026. Downloading Texts from Project Gutenberg using R. The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia. url: https://ladal.edu.au/tutorials/gutenberg/gutenberg.html (Version 3.1.1). doi: 10.5281/zenodo.19332882.

@manual{martinschweinberger2026downloading,
  author       = {Martin Schweinberger},
  title        = {Downloading Texts from Project Gutenberg using R},
  year         = {2026},
  note         = {https://ladal.edu.au/tutorials/gutenberg/gutenberg.html},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL), The University of Queensland, Australia},
  edition      = {2026.03.27}
  doi      = {}
}

Code

sessionInfo()

R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] flextable_0.9.7  DT_0.33          gutenbergr_0.2.4 lubridate_1.9.4 
 [5] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.0.4     
 [9] readr_2.1.5      tidyr_1.3.1      tibble_3.2.1     ggplot2_3.5.1   
[13] tidyverse_2.0.0 

loaded via a namespace (and not attached):
 [1] gtable_0.3.6            xfun_0.51               bslib_0.9.0            
 [4] htmlwidgets_1.6.4       tzdb_0.4.0              vctrs_0.6.5            
 [7] tools_4.4.2             crosstalk_1.2.1         generics_0.1.3         
[10] curl_6.2.1              parallel_4.4.2          klippy_0.0.0.9500      
[13] pkgconfig_2.0.3         data.table_1.17.0       assertthat_0.2.1       
[16] uuid_1.2-1              lifecycle_1.0.4         compiler_4.4.2         
[19] textshaping_1.0.0       munsell_0.5.1           codetools_0.2-20       
[22] fontquiver_0.2.1        fontLiberation_0.1.0    htmltools_0.5.8.1      
[25] sass_0.4.9              lazyeval_0.2.2          yaml_2.3.10            
[28] crayon_1.5.3            pillar_1.10.1           jquerylib_0.1.4        
[31] openssl_2.3.2           cachem_1.1.0            fontBitstreamVera_0.1.1
[34] tidyselect_1.2.1        zip_2.3.2               digest_0.6.37          
[37] stringi_1.8.4           fastmap_1.2.0           grid_4.4.2             
[40] colorspace_2.1-1        cli_3.6.4               magrittr_2.0.3         
[43] triebeard_0.4.1         utf8_1.2.4              withr_3.0.2            
[46] gdtools_0.4.1           scales_1.3.0            bit64_4.6.0-1          
[49] timechange_0.3.0        rmarkdown_2.29          officer_0.6.7          
[52] bit_4.5.0.1             askpass_1.2.1           ragg_1.3.3             
[55] hms_1.1.3               evaluate_1.0.3          knitr_1.49             
[58] urltools_1.7.3          rlang_1.1.5             Rcpp_1.0.14            
[61] glue_1.8.0              xml2_1.3.6              renv_1.1.1             
[64] vroom_1.6.5             rstudioapi_0.17.1       jsonlite_1.9.0         
[67] R6_2.6.1                systemfonts_1.2.1

AI Transparency Statement

This tutorial was re-developed with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to help revise the tutorial text, structure the instructional content, generate the R code examples, and write the checkdown quiz questions and feedback strings. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy and pedagogical appropriateness of the material. The use of AI assistance is disclosed here in the interest of transparency and in accordance with emerging best practices for AI-assisted academic content creation.

Back to top

Back to LADAL home

Introduction

Setup

Installing Packages

Loading Packages

A Robust Download Function

Exploring the Project Gutenberg Catalogue

The Metadata Table

Browsing with gutenberg_works()

Filtering the Catalogue

Filter by Author

Filter by Language

Filter by Subject / Bookshelf

Filter by Multiple Criteria

Downloading Individual Texts

Download Romeo and Juliet (ID: 1513)

Download with Additional Metadata

Downloading Multiple Texts

Downloading by ID Vector

Downloading All Works by an Author

Building a Multi-Author Corpus

Cleaning and Preparing Downloaded Texts

What Raw Downloads Look Like

Removing Boilerplate

Removing Empty Lines

Collapsing to a Single String

Splitting into Acts and Scenes

Splitting into Chapters

Saving Cleaned Texts

Troubleshooting

Encoding Issues

Text Not Found

Verifying a Download

AI Statement

Citation & Session Info

Browsing with `gutenberg_works()`